A Variant of K-Means Clustering through Heuristic Initial Seed Selection for Improved Clustering of Data

نویسندگان

  • R. Geetha Ramani
  • Lakshmi Balasubramanian
چکیده

Unsupervised clustering algorithms have been used in many applications to group the data based on relevant similarity metrics. K-Means clustering is one of the most widely used clustering techniques owing to its simplicity. Many improvements and extensions have been proposed for this algorithm in view to improve its performance. Out of the various dimensions that have been explored in this regard such as mean computation, centroid representation, initial seed/cluster centre selection and similarity calculation methods, the choice of initial cluster centre is found to have a profound impact in the performance of the algorithm. Existing methods chose the cluster centres either randomly or based on heuristics such as maximum distance property, maximum probability of the squared distance, points with maximum points lying close to it etc. In this paper, a strategy to select relevant initial cluster centres for two-cluster grouping problems is proposed based on the measures indicating the statistical distribution of the data in view to improve the clustering performance in terms of accuracy. These measures include minimum, maximum, median, mean and skew of the data. The algorithm is validated on datasets from UCI repository viz. Balance, BloodDonate, Diabetes, Ionosphere, Parkinsons and Sonar and synthetic datasets. The performance of the proposed algorithm is compared with K-Means and its variants and found to achieve better performance in terms of accuracy. An increase is accuracy of approximately 0.25%-18% is observed across the datasets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Clustring Using A New CGA(Chaotic-Generic Algorithm) Approach

Clustering is the process of dividing a set of input data into a number of subgroups. The members of each subgroup are similar to each other but different from members of other subgroups. The genetic algorithm has enjoyed many applications in clustering data. One of these applications is the clustering of images. The problem with the earlier methods used in clustering images was in selecting in...

متن کامل

Data Clustring Using A New CGA(Chaotic-Generic Algorithm) Approach

Clustering is the process of dividing a set of input data into a number of subgroups. The members of each subgroup are similar to each other but different from members of other subgroups. The genetic algorithm has enjoyed many applications in clustering data. One of these applications is the clustering of images. The problem with the earlier methods used in clustering images was in selecting in...

متن کامل

Persistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm

Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...

متن کامل

Use of the Improved Frog-Leaping Algorithm in Data Clustering

Clustering is one of the known techniques in the field of data mining where data with similar properties is within the set of categories. K-means algorithm is one the simplest clustering algorithms which have disadvantages sensitive to initial values of the clusters and converging to the local optimum. In recent years, several algorithms are provided based on evolutionary algorithms for cluster...

متن کامل

GROUND MOTION CLUSTERING BY A HYBRID K-MEANS AND COLLIDING BODIES OPTIMIZATION

Stochastic nature of earthquake has raised a challenge for engineers to choose which record for their analyses. Clustering is offered as a solution for such a data mining problem to automatically distinguish between ground motion records based on similarities in the corresponding seismic attributes. The present work formulates an optimization problem to seek for the best clustering measures. In...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016